episodic reinforcement learning
Constrainedepisodicreinforcementlearningin concave-convexandknapsacksettings
Our approach relies on the principle ofoptimism under uncertaintyto efficiently explore. Our learning algorithms optimizetheiractions withrespect toamodel based ontheempirical statistics, while optimistically overestimating rewards and underestimating the resource consumption (i.e., overestimating the distance from the constraint).
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning
Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applies to more general episodic reinforcement learning problems with theoretical guarantees. We introduce novel proof techniques to show that under suitable conditions, the worst-case regret of our posterior sampling method matches the best known results of optimization based methods. In the linear MDP setting with dimension, the regret of our algorithm scales linearly with the dimension as compared to a quadratic dependence of the existing posterior sampling-based exploration algorithms.
Steady State Analysis of Episodic Reinforcement Learning
Reinforcement Learning (RL) tasks generally divide into two kinds: continual learning and episodic learning. The concept of steady state has played a foundational role in the continual setting, where unique steady-state distribution is typically presumed to exist in the task being studied, which enables principled conceptual framework as well as efficient data collection method for continual RL algorithms. On the other hand, the concept of steady state has been widely considered irrelevant for episodic RL tasks, in which the decision process terminates in finite time. Alternative concepts, such as episode-wise visitation frequency, are used in episodic RL algorithms, which are not only inconsistent with their counterparts in continual RL, and also make it harder to design and analyze RL algorithms in the episodic setting. In this paper we proved that unique steady-state distributions pervasively exist in the learning environment of episodic learning tasks, and that the marginal distributions of the system state indeed approach to the steady state in essentially all episodic tasks.
Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning
We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs. Our results show that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy.
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
Statistical performance bounds for reinforcement learning (RL) algorithms can be critical for high-stakes applications like healthcare. This paper introduces a new framework for theoretically measuring the performance of such algorithms called Uniform-PAC, which is a strengthening of the classical Probably Approximately Correct (PAC) framework. In contrast to the PAC framework, the uniform version may be used to derive high probability regret guarantees and so forms a bridge between the two setups that has been missing in the literature. We demonstrate the benefits of the new framework for finite-state episodic MDPs with a new algorithm that is Uniform-PAC and simultaneously achieves optimal regret and PAC guarantees except for a factor of the horizon.
- Asia > Middle East > Jordan (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Maryland (0.04)
- (4 more...)
Review for NeurIPS paper: Steady State Analysis of Episodic Reinforcement Learning
Clarity: In my opinion the main weakness of the paper is its presentation. First, there is a lack of clear, direct, explanations of what the paper is trying to accomplish. Several crucial points are either only implied or mentioned in passing without the proper emphasis. This is true for the positioning of the paper itself. The analysis seems to be mostly concerned with policy gradient methods, but this is never explicitly stated.
Review for NeurIPS paper: Steady State Analysis of Episodic Reinforcement Learning
This paper provides a new perspective in thinking about episodic RL, and should be of interest to anyone working with MDPs in reinforcement learning. Three reviewers (R1, R2, R3) commented that it was well-written and clear, although R4 disagreed. All reviewers commented on the interesting contributions (proving that MDPs within episodic RL can be proven to be ergodic). R1, R2, and R3 had concerns that it was a mostly theoretical paper, and wondered how to practically apply these insights. However, the rebuttal goes some way to address these points, and R4 was convinced to raise their recommendation to weak accept.
Review for NeurIPS paper: A Unifying View of Optimism in Episodic Reinforcement Learning
Weaknesses: While I like the duality result, I find this paper is not substantial enough that merits acceptance. This paper shows a class of model-optimistic algorithms can be implemented efficiently (with minor modifications). However, none of state-of-the-art algorithms is model-optimistic algorithms. This is somehow inherent with this class of algorithms because the transition model scales with S 2 but the optimal bounds scale linearly in S via value-optimistic algorithms. Value-optimisic algorithms are not only more computationally efficient but also more statistically efficient. So making model-optimistic algorithms more efficient is not a very significant result.
Review for NeurIPS paper: A Unifying View of Optimism in Episodic Reinforcement Learning
The reviewers are in agreement that this is interesting and well-presented work. The main concern was about the extent to which the results will help us derive SOTA algorithms in the future. I find the contribution reasonable without this and hope the community will figure out how/if these results are useful. Please do take the reviewers minor suggestions into consideration when preparing a final version.